Goto

Collaborating Authors

 image plane


Appendices and Supplementary Material

Neural Information Processing Systems

A.1 Coordinate Systems and Transformation To achieve spatial synchronization between different sensors, vehicle-vehicle-UAV collaboration requires using sensor parameter information to perform coordinate system transformations. The relationships between the coordinate systems are illustrated in Fig. S 1. Figure 1: Relationship between coordinate systems. The pixel coordinate system refers to a two-dimensional coordinate system defined on the image plane, typically represented as (u,v), with units in pixels. In this system, the origin is located at the top-left corner of the image, the u-axis points to the right along the horizontal direction, and the v-axis points downward along the vertical direction. This coordinate system is used to describe the position of points on the two-dimensional image captured by the camera.


Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

Neural Information Processing Systems

People continuously perceive and interact with their surroundings based on underlying intentions that drive their exploration and behaviors. While research in egocentric user and scene understanding has focused primarily on motion and contact-based interaction, forecasting human visual perception itself remains less explored despite its fundamental role in guiding human actions and its implications for AR/VR and assistive technologies. We address the challenge of egocentric 3D visual span forecasting, predicting where a person's visual perception will focus next within their three-dimensional environment. To this end, we propose EgoSpanLift, a novel method that transforms egocentric visual span forecasting from 2D image planes to 3D scenes. EgoSpanLift converts SLAM-derived keypoints into gaze-compatible geometry and extracts volumetric visual span regions. We further combine EgoSpanLift with 3D U-Net and unidirectional transformers, enabling spatio-temporal fusion to efficiently predict future visual span in the 3D grid.


Flare7K: APhenomenological Nighttime Flare Removal Dataset (Supplementary Material)

Neural Information Processing Systems

In this supplementary material, we present additional details of the proposed Flare7K dataset and experimental settings and show more results. Figure 1: Illustration of a simplified lens system. In the lens and aperture plane, the light passes through the dirty aperture and lens system, leaving a scattering flare on the image plane. In this section, we use a simplified Fourier optics model to illustrate how different kinds of scattering flares occur. A basic lens system can be viewed as a combination of one convex lens, one aperture, and an image plane as shown in Figure 1. We set the optical center as the origin of a coordinate system. Then, the light source's position is (x0,y0, z0). It is a combination of aperture function eAฮป(x,y) and a lens function eTL(x,y). Supposing the focus of the lens is f and the lens is ideal. After adjusting the origin of x1 and x2, Equation (11) can be viewed as a standard Fourier transformation. Thus, the point spread function (PSF) which is the square of the amplitude of the image plane's optical field can be written as: PSFฮป = |F{eAฮป(x,y)}|2. Since stains with depth may bring phase shift for the aperture function, the PSFฮป may vary with the wavelength ฮปof the light source.




Ground Plane Projection for Improved Traffic Analytics at Intersections

arXiv.org Artificial Intelligence

Accurate turning movement counts at intersections are important for signal control, traffic management and urban planning. Computer vision systems for automatic turning movement counts typically rely on visual analysis in the image plane of an infrastructure camera. Here we explore potential advantages of back-projecting vehicles detected in one or more infrastructure cameras to the ground plane for analysis in real-world 3D coordinates. For single-camera systems we find that back-projection yields more accurate trajectory classification and turning movement counts. We further show that even higher accuracy can be achieved through weak fusion of back-projected detections from multiple cameras. These results suggeest that traffic should be analyzed on the ground plane, not the image plane




Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion

arXiv.org Artificial Intelligence

Abstract--Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; however, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. T o address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi-modal light field point-cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential reconstruction of point-cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image-only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud-only segmentation by 2.38 mIoU, demonstrating its effectiveness. S a fundamental task in computer vision, semantic segmentation is crucial for a wide range of applications, including autonomous driving [1], road detection [2], and medical image processing [3]. Existing semantic segmentation methods can be divided into image-based semantic segmentation [4]-[17] and LiDAR-point-cloud-based semantic segmentation [18]-[25] according to different types of input data.


3DRot: 3D Rotation Augmentation for RGB-Based 3D Tasks

arXiv.org Artificial Intelligence

RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since most image transforms, including resize and rotation, disrupt geometric consistency. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera's optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry-achieving geometry-consistent rotations and reflections without relying on any scene depth. We validate 3DRot with a classical 3D task, monocular 3D detection. On SUN RGB-D dataset, 3DRot raises $IoU_{3D}$ from 43.21 to 44.51, cuts rotation error (ROT) from 22.91$^\circ$ to 20.93$^\circ$, and boosts $mAP_{0.5}$ from 35.70 to 38.11. As a comparison, Cube R-CNN adds 3 other datasets together with SUN RGB-D for monocular 3D estimation, with a similar mechanism and test dataset, increases $IoU_{3D}$ from 36.2 to 37.8, boosts $mAP_{0.5}$ from 34.7 to 35.4. Because it operates purely through camera-space transforms, 3DRot is readily transferable to other 3D tasks.